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A 

In  this  paper,  the  red-blue  pebble  game  is  proposed  to  model  the  input-output  complexity  of  algorithms. 
Using  the  pebble  game  formulation,  a  number  of  lower  bound  results  for  the  I/O,  requirement  are  proven. 
For  example,  it  is  shown  that  to  perform  the  n-point  FFT  (or  the  ordinary  nxn  matrix  multiplication 
algorithm)  with,  a  device  of  0(S)  memory,  at  least  Q(n  log  n/log  S)  (or  Q(nV-y'S),  respectively)  time  is 
needed  for  the  I/O.  Similar  results  are  obtained  for  algorithms  for  several  other  problems.  All  of  the  lower 
bounds  presented  are  the  best  possible  in  the  sense  that  they  are  achievable  by  certain  decomposition 
schemes. 

The  results  in  this  paper  provide  insight  into  the  difficult  task  of  balancing  I/O  and  computation  in  special- 
purpose  system  design.  For  example,  for  the  n-point  FFT,  the  I/O  lower  bound  implies  that  an  S-point 
device  achieving  a  speed-up  ratio  0(log  S)  over  the  conventional  0(n  log  n)  implementation  is  all  that  one  can 
hope  for. 
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1 .  Introduction 

When  a  large  computation  is  performed  on  a  small  device  or  memory,  the  computation  must  be 
decomposed  into  subcomputations.  Executing  subcomputations  one  at  a  time  may  require  a  substantial 
amount  of  I/O  to  store  or  retrieve  intermediate  results.  Very  often  it  is  the  I/O  that  dominates  the  speed  of  a 
computation.  In  fact,  I/O  is  a  typical  bottleneck  for  performance  at  all  levels  of  a  computer  system.  However, 
to  the  authors’  knowledge  the  I/O  problem  was  not  previously  modelled  or  studied  in  any  systematic  or 
abstract  manner.  Similar  problems  were  studied  only  in  a  few  isolated  instances  [2, 5],  This  paper  proposes  a 
pebble  game,  called  the  red- blue  pebble  game ,  to  model  the  problem,  and  presents  a  number  of  lower  bound 
results  for  the  I/O  requirement.  All  the  lower  bounds  presented  can  be  shown  to  be  the  best  possible,  in  the 
sens?  that  they  are  achieved  by  certain  decomposition  schemes.  The  paper  is  organized  according  to  the 
techniques  used  to  derive  these  lower  bounds. 

In  Section  2  we  formally  define  the  pebble  game  and  point  out  its  relation  to  the  I/O  problem.  In  Section  3 
we  show  that  lower  bounds  for  I/O  in  the  pebble  game  can  be  established  by  studying  the  so-called 
S- partitioning  problem.  This  is  the  key  result  of  the  paper  in  the  sense  that  it  provides  the  basis  for  the 
derivation  of  all  the  lower  bounds.  In  Section  4  we  prove  a  lower  bound  for  the  FFT  algorithm.  Lower 
bounds  in  Section  5  are  based  on  the  information  speed  Junction ,  which  measures  how  fast  the  number  of 
vertices  on  which  a  given  vertex  "depends"  can  grow  in  a  directed  acyclic  graph  of  a  certain  type.  We 
demonstrate  the  dramatic  difference  between  the  I/O  requirement  for  the  odd-even  transposition  sorting 
network  and  that  for  the  "snake-like”  mesh  graph.  In  contrast  to  the  focus  of  Section  5,  Section  6  studies 
independent  computations  for  which  there  are  very  little  information  exchanges  among  vertices.  There  we 
obtain,  for  example,  a  lower  bound  for  the  ordinary  matrix  multiplication  algorithm.  In  Section  7  we  prove  a 
general  theorem  on  products  of  graphs.  Using  this  theorem,  one  can  determine  the  I/O  required  by  a  product 
of  graphs,  by  examining  only  the  individual  graphs.  A  summary  and  concluding  remarks  are  provided  in 
Section  8. 

Results  of  this  paper  impose  upper  bounds  on  the  maximum  possible  speed-up  obtainable  with  a  special- 
purpose  hardware  device  when  the  bandwidth  of  the  memory  that  supplies  data  to  the  device  remains 
constant.  For  example,  our  lower  bound  on  the  I/O  requirement  for  the  n-point  FFT  (Corollary  4.1)  implies 
that  an  S-point  device  can  achieve  a  speed-up  ratio  of  at  most  O(log  S)  over  the  conventional  0(n  log  n) 
software  implementation.  Similarly,  for  matrix  multiplication  our  result  (Corollary  6.2)  implies  that  a 
VS  xVS  device  can  achieve  a  speed-up  ratio  of  at  most  0(  VS ). 
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2.  The  Red-Blue  Pebble  Game  and  Its  Relation  to  the  I/O  Problem 

As  the  usual  pebble  game  (see,  e.g.,  (4j),  the  red-blue  pebble  game  is  played  on  a  directed  acyclic  graph1. 
At  any  point  in  the  pebble  game,  some  vertices  of  the  graph  will  have  red  pebbles,  some  will  have  blue 
pebbles,  some  will  have  both  red  and  blue  pebbles  and  the  remainder  will  have  no  pebbles  at  all.  Following 
the  notation  of  Pippcnger  [8],  define  a  configuration  as  a  pair  of  subsets  of  the  vertices,  one  comprised  of  just 
the  vertices  having  red  pebbles,  and  the  other  just  those  having  blue  pebbles.  Thus  vertices  belonging  to  the 
intersection  of  the  two  sets  have  both  red  and  blue  pebbles  on  them.  The  set  of  inputs  (or  outputs)  of  the 
graph  is  some  designated  set  of  vertices  containing  at  least  those  vertices  that  have  no  predecessors  (or 
successors,  respectively).  We  assume  that  the  set  of  inputs  is  disjoint  from  that  of  outputs.  For  all  the 
examples  discussed  in  the  paper,  only  vertices  that  have  no  predecessors  (or  successors)  are  assumed  to  be 
inputs  (or  outputs,  respectively),  except  in  Section  7  where  products  of  graphs  are  considered.  The  initial  (or 
terminal)  configuration  is  one  in  which  only  inputs  (or  outputs,  respectively)  have  pebbles,  and  they  are  all 
blue  pebbles.  The  rules  of  the  red-blue  pebble  game  are  as  follows. 

Rl.  (Input)  A  red  pebble  may  be  placed  on  any  vertex  that  has  a  blue  pebble. 

R2.  (Output)  A  blue  pebble  may  be  placed  on  any  vertex  that  has  a  red  pebble. 

R3.  (Compute)  If  all  the  immediate  predecessors  of  a  vertex  have  red  pebbles,  a  red  pebble  may  be 
placed  on  that  vertex. 

R4.  (Delete)  A  pebble  (red  or  blue)  may  be  removed  from  any  vertex. 

A  transition  is  an  ordered  pair  of  configurations,  the  second  of  which  follows  from  the  first  according  to  one 
of  the  rules.  A  calculation  is  a  sequence  of  configurations,  each  successive  pair  of  which  form  a  transition.  A 
complete  calculation  is  one  that  begins  with  the  initial  configuration  and  ends  with  the  terminal  configuration. 

A  graph  on  which  the  red-blue  pebble  game  is  played  can  model  a  computation  performed  on  a  two-level 
memory  structure,  consisting  of  say,  a  fast  memory  and  a  slow  memory.  Vertices  represent  operations  and 
their  results.  An  edge  from  one  vertex  to  another  indicates  that  the  result  of  one  operation  is  an  operand  of 
the  other.  An  operation  can  be  performed  only  if  all  the  operands  reside  in  the  fast  memory.  Placing  a  red 
pebble  using  rule  R3  corresponds  to  performing  an  operation  and  storing  the  result  in  the  fast  memory. 
Placing  a  blue  pebble  using  rule  R2  corresponds  to  storing  a  copy  of  a  result  (currently  in  the  fast  memory) 
into  the  slow  memory,  whereas  placing  a  red  pebble  using  Rl  corresponds  to  retrieving  a  copy  of  a  result 
(currently  in  the  slow  memory)  into  the  fast  memory.  Removing  a  red  or  blue  pebble  using  rule  R4  means 
freeing  a  memory  location  in  the  fast  or  slow  memory,  respectively.  The  maximum  allowable  number  of  red 


Lrhe  red-blue  pebble  game  discussed  in  this  paper  is  not  related  in  any  way  to  the  black-and-white  pebble  game  introduced  by  Cook 
and  Setht  (1). 


or  blue  pebbles  on  the  graph  at  any  point  in  the  game  corresponds  to  the  number  of  words  available  for  use  in 
the  fast  or  slow  memory,  respectively. 


For  the  purpose  of  this  paper,  we  assume  that  the  fast  memory  can  hold  only  S  words,  where  S  is  a 
constant,  while  the  slow  memory  is  arbitrarily  large.  Thus  when  the  pebble  game  is  played  on  a  graph,  at 
most  S  red  pebbles,  and  any  number  of  blue  pebbles,  can  be  on  the  graph  at  any  time.  For  any  given  graph, 
we  are  interested  in  the  minimum  I/O  time  Q,  which  is  defined  by 

Q  -  the  minimum  number  of  transitions  according  to  rule  R1  or  R2  required  by  any  complete 
calculation. 


that  it  does  not  apply  rules  on-the*fly  as  in  a  game.  Wc  show  that  lower  bounds  for  the  S-partitioning 
problem  can  be  translated  into  lower  bounds  on  Q  for  the  red-blue  pebble  game. 

3.  The  S-Partitioning  Problem  and  the  Key  Lemma 

In  this  section  we  show  that  every  complete  calculation  of  the  red-blue  pebble  game  on  a  directed  acyclic 
graph  defines  a  partition  of  the  graph.  Let  G  -  (V,  E)  be  a  directed  acyclic  graph  where  V  and  E  are  the 
vertex  and  edge  sets  of  G.  respectively.  A  family  of  subsets  of  V,  {Vlt  Vy  . . .,  Vh),  is  called  an  S-partilion  of 
G  for  some  positive  integer  S  if  the  following  four  properties  hold. 

PL  The  W’s  are  disjoint  and  uj^V.  -  V. 

P2.  For  each  Vjt  1  <  i  <  h,  there  exists  a  dominator  set  D)  for  V;  that  contains  at  most  S  vertices.  (A 
dominator  set  for  V.  is  defined  to  be  a  set  of  vertices  in  V  such  that  every  path  from  an  input  of  G 
to  a  vertex  in  V.  contains  some  vertex  in  the  set) 

P3.  For  each  V.,  1  <  i  <  h,  the  minimum  set  M.  of  V.  has  at  most  S  vertices.  (The  minimum  set  of  V.  is 
defined  to  be  the  set  of  those  vertices  in  V.  that  do  not  have  any  sons  belonging  to  Vj.) 

P4.  There  is  no  cyclic  dependence  among  vertex  sets  in  {Vr  V2 . Vh}.  (A  vertex  set  V.  is  said  to 

depend  on  another  vertex  set  V.  if  there  is  an  edge  in  E  from  a  vertex  in  V.  to  a  vertex  in  V.) 

Theorem  3.1.  Let  G  » (V,  E)  be  a  directed  acyclic  graph.  Any  complete  calculation  of  the  red- 
blue  pebble  game  on  G,  using  at  most  S  red  pebbles,  is  associated  with  a  2S-partition  of  G  such 
that 

S'h  >q  >  S-(h  -  1), 

where  q  is  the  I/O  time  required  by  the  complete  calculation,  and  h  is  the  number  of  vertex  sets  in 
the  2S-partition. 

Proof:  Denote  by  C  any  complete  calculation.  We  can  divide  C  into  a  sequence  of  h 

consecutive  subcalculations ,  Cr  C2 . Ch,  for  some  h  such  that  in  each  C,  1  <  i  <  h-1,  there  are 

exactly  S  transitions  using  rule  R1  or  R2,  and  in  Ch  there  are  no  more  than  S  such  transitions.  For 

i  -  1 . h,  define  V.  to  be  the  largest  vertex  set  in  which  each  vertex  satisfies  the  following  three 

properties. 

(i)  During  subcalculation  C  it  has  a  red  pebble  placed  on  it  using  rule  R1  or  R3. 

(ii)  At  the  end  of  subcalculation  C,  it  either  has  red  pebbles,  or  blue  pebbles  that  are  placed  on 
it  during  or  has  a  son  in  Vj. 

(iii)  It  does  not  belong  to  any  V  with  j  <  i. 

We  claim  that  the  family  {V,,  V2 . Vh}  is  a  2S-partition  of  G.  First  wc  show  that  property  PI 

holds.  By  (iii)  it  follows  immediately  that  the  V  's  arc  disjoint.  In  die  following  we  show  that  every 
vertex  in  V  belongs  to  some  V..  Suppose  that  a  vertex,  which  is  not  an  input,  has  a  red  or  blue 
pebble  on  it  at  the  end  of  some  subcalculation  C;.  Then  there  must  exist  a  subcalculation  C ,  j  <  i. 
during  which  the  vertex  has  a  red  pebble  placed  on  it  using  rule  R3.  and  at  the  end  of  C  it  either 
remains  to  have  the  red  pebble  or  has  a  blue  pebble  that  is  placed  on  it  during  C.  This  implies 
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that  the  vertex  belongs  to  Vk  for  some  k  <  j.  Similarly  one  can  show  that  if  an  input  has  a  red 
pebble  on  it  at  the  end  of  C,  then  it  must  belong  to  Vfc  for  some  k  <  i.  Because  calculation  C  is  a 
complete  calculation,  all  outputs  have  blue  pebbles  on  them  at  the  end  of  the  last  subcalculation 
Ch;  thus  they  all  belong  to  ujV ,  V..  Consider  now  any  immediate  predecessor  u  of  an  output  v. 
Suppose  that  v  belongs  to  V..  Then  v  cannot  have  any  pebble  on  it  at  the  beginning  of  C  and  thus 
must  have  a  red  pebble  placed  on  it  using  R3  during  C.  This  implies  that  we  have  one  of  the 
following  two  cases: 

Case  1:  Vertex  u  has  a  red  pebble  on  it  at  the  end  of  subcalculation  C_L.  Then  by 
reasons  stated  above,  u  belongs  to  some  V.,  j  <  i-L 

Case  2:  Vertex  u  has  a  red  pebble  placed  on  it  using  rule  R1  or  R3  during  C.  If  u  does 
not  belong  to  any  V.  with  j  <  i,  then  because  u  has  a  son  v  in  V ,  u  itself  must  belong  to 

Vi* 

We  have  shown  that  all  the  immediate  predecessors  of  outputs  belong  to  uj^jV..  Similarly,  we  can 
show  that  all  the  immediate  predecessors  of  the  immediate  predecessors  of  outputs  belong  to 
Uj^V..  Property  PI  follows  by  induction.  Note  that  both  Case  1  and  Case  2  above  imply  that  if 
V.  depends  on  V.  then  j  <  i.  Therefore  there  cannot  be  any  cyclic  dependence  among  V.’s,  and  thus 
property  P4  holds.  For  proving  property  P2  for  any  V.,  1  <  i  <  h,  we  consider  two  subsets  of  V, 

VR  and  VRR,  which  are  defined  as  follows. 

•  VR  consists  of  those  vertices  that  have  red  pebbles  placed  on  them  just  before 
subcalculation  C  begins. 

•  VRR  consists  of  those  vertices  that  have  blue  pebbles  placed  on  them  just  before 
subcalculation  C.  begins  and  have  red  pebbles  placed  on  them  according  to  rule  R1  during 

Ci- 

It  is  easy  to  see  that  by  property  (i)  in  the  definition  of  V.,  VR  u  VfiR  forms  a  dominator  set  for  V.. 
Since  there  can  be  at  most  S  red  pebbles  on  G  at  any  time,  we  have 

lvRUS. 

The  fact  that  at  most  S  transitions  can  use  rule  R1  during  C.  implies  that 

IvbrI  ^  s. 

Thus 

IvruVb*UIvrMVbrIs2S. 

We  have  shown  that  {V1(  V2, . . Vh}  satisfies  property  P2.  The  proof  of  property  P3  is  similar. 

By  property  (ii)  in  the  definition  of  Vjt  we  know  that  at  the  end  of  subcalculation  Cj(  every  vertex 
in  Mj.  the  minimum  set  of  V.,  has  red  pebbles,  or  blue  pebbles  that  are  placed  on  it  during  C. 
Since  there  can  be  at  most  S  vertices  having  red  pebbles  placed  on  them  at  any  time,  and  at  most  S 
vertices  having  blue  pebbles  placed  on  them  according  to  rule  R2  during  C,  the  minimum  set  M. 
can  have  at  most  2S  vertices.  We  have  shown  that  {V,,  V2,  .  . .,  Vh)  is  a  2S-partition  of  G.  The 
theorem  follows  by  noting  that  corresponding  to  each  V.,  1  <  i  $  h-1,  exactly  S  transitions  using 
R1  or  R2  arc  performed  and  to  Vh>  no  more  than  S  such  transitions  are  performed.  □ 


Let 


P(S)  -  the  minimum  number  of  vertex  sets  that  any  S-partition  of  G  must  have. 
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We  have,  by  Theorem  3.1,  the  key  lemma  of  the  paper. 

Lemma  3.1.  For  any  directed  acyclic  graph  G,  the  minimum  I/O  time  satisfies 
Q  >  S-(P(2S)  -  1). 

Using  this  lemma,  lower  bounds  for  P  can  be  translated  immediately  into  lower  bounds  for  Q. 


4.  Lower  Bounds  for  the  FFT  Computation 

In  this  section  we  establish  a  lower  bound  on  the  minimum  I/O  time  Q  for  the  n-point  FFT  graph  (see 
Figure  2- 1(a)),  by  proving  a  lower  bound  on  P. 


Define  an  S-dominator  partition  of  a  graph  G  -  (V,  E)  to  be  a  family  of  subsets  of  V,  {V1,  V2, . . Vh}, 
satisfying  properties  PI,  P2  and  P4  of  an  S-partition,  but  not  necessarily  property  P3.  Let 

PD(S)  -  the  minimum  number  of  vertex  sets  that  any  S-dominator  partition  of  G  must  have. 

Then  clearly  PD(S)  <  P(S),  since  any  S-partition  is  also  an  S-dominator  partition.  The  following  theorem 
establishes  a  lower  bound  on  PD(S),  and  thus  a  lower  bound  on  P(S). 

Theorem  4.1.  Suppose  that  S  >  2.  The  minimum  number  of  vertex  sets  that  any  S-dominator 
partition  of  the  n-point  FFT  graph  must  have  satisfies 

PD(S)-0((nlog  n)/(S  log  S)). 

Proof:  Since  there  are  a  total  of  9(n  log  n)  vertices  in  the  n-point  FFT  graph,  it  suffices  to 
prove  that  any  vertex  set  U  that  has  a  dominator  set  of  size  no  more  than  S,  S  <  n,  can  have  at 
most  S  log  S  +  S  vertices.  We  shall  show  this  by  induction  on  n.  The  assertion  holds  trivially  for 
the  case  when  n  -  2.  Assume  now  that  it  holds  for  the  m-point  FFT  for  any  m  <  n.  We  want  to 
show  that  it  holds  for  m  -  n.  Consider  the  n-point  FFT  graph.  We  partition  its  vertex  set  into  four 
disjoint  sets  A.  B,  C  and  D  such  that  sets  C  and  D  equally  partition  the  set  of  outputs,  and  sets  A 
and  B  equally  separate  the  rest  of  the  vertices.  See  Figure  4-1  below.  Let  dA<  dR.  dc  or  dQ  be  the 
number  of  elements  in  the  dominator  set  that  belong  to  sets  A,  B,  C  or  D  respectively.  Let  u  v  uB, 
uc  or  uD  be  the  number  of  elements  in  the  vertex  set  U  that  belong  to  sets  A,  B.  C  or  D, 
respectively.  It  is  easy  to  see  that  elements  in  set  A  that  are  also  in  the  dominator  set  form  a 
dominator  set  for  set  U  n  A.  Thus  by  the  inductk  n  hypothesis, 

UA^dAl0SdA  +  dA-  0) 

Similarly,  we  have 

uB<dBlogdB  +  dB.  (2) 

Let  Ra  or  RB  be  the  set  of  those  horizontal  paths  from  inputs  to  outputs  on  which  there  are 
vertices  in  the  dominator  set  that  belong  to  A  or  B,  respectively.  Then 

|RJ  <  dA  and  |RB|  <  dB.  (3) 

For  each  vertex  in  U  n  C  or  U  n  D,  it  cither  belongs  to  the  dominator  set  or  one  of  its  immediate 
predecessors  is  on  a  horizontal  path  belonging  to  R^  and  the  other  on  one  belonging  to  RB. 
Therefore 

uc  <  dc  +  min(|RA|,  |RB|)  and  uD  <  dQ  +  min(|RA|,  |RB|), 
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from  which  we  have 

(4) 

uC  +  uDidC  +  dD  +  2mm()RA|,  |RB|). 

By  (1),  (2),  (3)  and  (4), 

UA  +  UB  +  UC  +  UD  *  IdA  ,08  dA  +  dB  l0g  dB  +  <*B>1  *  dA  +  dB  +  dC  +  dD‘ 

Since  dA  +  dB  ^  S  -  dc  -  dD,  we  have 

uA  +  uB  +  uc  +  uD  <  (S-dc-dD)  log  (S-dc-dD)  +  S  <  S  log  S+S, 
which  completes  the  induction  proof.  □ 

By  Lemma  3.1  we  have  the  following  lower  bound  result 
Corollary  4.1.  For  the  n-point  FFT  graph, 

Q  •  log  S  -  12(n  log  n). 

Thus  the  I/O  time  for  the  n-point  FFT  when  executed  on  a  special-purpose  device  with  S  words  of  memory  is 
at  least  Q(n  log  n/log  S),  implying  that  the  maximum-possible  speed-up  ratio  over  the  usual  0(n  log  n) 
implementation  is  at  most  O(log  S).  This  upper  bound  on  the  speed-up  ratio  holds  no  matter  how  fast  the  the 
device  may  be,  since  it  is  a  consequence  of  the  I/O  consideration.  The  upper  bound  can  be  reduced  only  if 
the  bandwidth  of  the  memory  that  supplies  data  to  the  special-purpose  device  is  increased.  A  systolic  device 
that  distributes  S  words  of  memory  in  a  linear  processor  array  and  achieves  9(log  S)  speed-up  for  the  FFT  is 
described  by  Kung  [7]. 
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5.  Lower  Bounds  Based  on  Information  Speed  Functions 

Many  "regular”  graphs  G  -  (V,  E)  have  the  property  that  all  inputs  can  reach  all  outputs  through  vertex- 
disjoint  paths.  In  the  proof  of  Theorem  4.1  we  have  already  noted  that  the  FFT  graph  has  this  property.  In 
the  current  section,  this  type  of  graph  will  be  considered.  The  vertex-disjoint  paths  from  inputs  to  outputs 
will  be  called  tines,  for  simplicity.  We  say  that  the  information  speed  Junction  is  J2(F(d))  if  for  any  two  vertices 
u ;  v  on  the  same  line  that  are  d  apart,  there  are  at  least  F(d)  venices  in  the  graph  satisfying  the  following  two 
properties. 

FI.  At  most  one  of  these  vertices  can  belong  to  a  single  line. 

F2.  Each  of  these  vertices  belongs  to  a  path  connecting  u  and  v. 

The  following  theorem  shows  that  lower  bounds  on  Q  can  be  obtained  from  lower  bounds  on  F  or  upper 
bounds  on  F1. 

Theorem  5.1.  For  any  graph  where  all  inputs  can  reach  all  outputs  through  vertex-disjoint 
paths,  if  the  information  speed  function  is  Q(F(d))  where  F  is  monotonically  increasing  and  F1 
exists,  then 

Q-Fl(S)-Q(LX 

where  L  is  ihe  total  number  of  vertices  on  the  vertex-disjoint  paths  or  the  lines. 

Proof:  As  in  the  proof  ofTheorem  4.1.  we  will  establish 
PD(S)  -  S2(L  /  (S  •  F^S))) 

by  showing  that  any  vertex  set  U  in  a  S-dominator  partition  can  have  at  most  0(S  •  Fl(S))  vertices 
on  the  lines.  Note  that  vertices  in  U  can  be  on  at  most  S  lines,  since  the  lines  are  vertex-disjoint 
and  U  has  a  dominator  set  of  size  at  most  S.  The  theorem  follows  from  the  claim  that  on  any  line 
there  can  be  at  most  F\S)  +  1  vertices  in  U.  Suppose  that  the  claim  is  false  for  some  line.  Then 
on  this  line  there  arc  two  vertices  u  and  v  in  U  that  are  F^S)  +  1  apart  Consequently,  there  are 
FfF'VS)  +  1)  vertices  satisfying  properties  FI  and  F2.  If  any  of  these  vertices  belongs  to  another 
vertex  set  1/  in  the  S-dominator  partition,  then  by  property  F2  there  will  be  a  cyclic  dependence 
among  vertex  sets  in  the  S-dominator  partition,  violating  property  P4  in  Section  3.  Therefore  all  of 
these  F(F1(S)  +  1)  vertices,  which  form  a  set  of  more  than  S  vertices,  belong  to  U,  and  by  property 
FI  they  belong  to  distinct  lines.  This  is  a  contradiction,  since  vertices  in  U  can  be  on  at  most  S 
lines.  □ 

Corollary  5.1.  For  the  odd-even  transposition  sorting  network  (see,  e.g.,  [6])  for  sorting  n- 
elcment  runs, 

Q  ■  S- Q(n2), 

for  any  S  <  n. 

Proof:  Consider  the  sub-network  that  includes  only  half  of  the  inputs  and  outputs,  as  shown  in 
Figure  5-1.  It  is  easy  to  see  that  we  can  assume  the  sub-network  has  n/2  lines  with  L  =  9(n2)  and 
F(d)  -  d/2  for  d  <  n.  □ 

Corollary  5.2.  For  the  mxn  snake-like  directed  mesh  as  shown  in  Figure  5-2, 

Q  -  Q(mn), 
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Figure  5-1:  The  odd-even  transposition  sorting  network,  where  each  "o"  is  a  comparator, 
for  any  S  <  m. 

Proof:  Consider  as  lines  all  the  horizontal  vertex-disjoint  paths  from  inputs  to  outputs.  It  is 
easy  to  see  that  we  can  assume  F(d)  =»  m  for  any  d  >  2.  Let  U  be  any  vertex  set  in  an  S-dominator 
partition  of  the  graph.  As  in  the  proof  of  Theorem  5.1,  we  note  that  vertices  in  U  can  be  on  at 
most  S  lines,  and  that  on  any  line  there  can  be  at  most  two  vertices  in  U.  Therefore,  U  can  have  at 
most  O(S)  vertices,  and  thus  PD(S)  (or  P(S))  -  Q(mn/S).  The  corollary  follows  from  Lemma  3,1.  □ 


Figure  5-2:  The  snake-like  directed  mesh. 

Straightforward  decomposition  schemes  will  show  that  lower  bounds  in  the  above  corollaries  are  best 
possible.  We  note  that  when  S  increases  the  I/O  requirement  Q  for  the  odd-even  transposition  sorting 
network  decreases  at  the  rate  of  I/S,  whereas  that  for  the  snake-like  directed  mesh  remains  unchanged 
essentially.  We  say  that  graphs  like  the  latter  are  Indecomposable . 
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6.  Independent  Evaluation  of  Multivariate  Expressions 

Given  values  for  indeterminates  xl . xq>  the  problem  is  to  evaluate  multivariate  polynomial  expressions 

yx  -  . xQ),  i  -  1,  2 . m.  Assume  that  each  y.  is  a  sum  of  at  least  two  terms  and  in  each  yit  all  the 

terms  are  distinct  and  have  degrees  £  D.  An  example  of  such  a  problem  is  matrix  multiplication,  where  D  -  2. 
An  independent  evaluation  of  y^s  is  an  algorithm  or  a  directed  acyclic  graph  with  inputs  x/s  and  outputs  y^s 
satisfying  the  following  properties. 

El.  In  the  evaluation  of  each  y^  all  (and  only)  those  product  terms  which  appear  in  the  fully 
distributed  expression  of  yi  are  computed  first  by  multiplications,  and  then  using  these  product 
terms  y.  is  formed  through  a  summation  tree  by  additions  or  subtractions  only.  In  particular,  no 
multiplication  can  be  performed  after  an  addition  or  subtraction. 

E2.  Internal  vertex  sets  of  the  summation  trees  for  all  the  y/s  are  disjoint  from  each  other,  that  is,  none 
of  the  internal  vertices  in  one  tree  appears  as  an  internal  vertex  in  another.  (Thus,  evaluations  of 
y/s  are  independent  from  each  other.) 

Let  X  be  any  set  of  x/s  or  products  in  x/s.  For  any  output  y[y  define  h(yif  X)  as  the  number  of  terms  in  y{ 
that  can  be  obtained  from  X  directly  or  by  multiplying  elements  in  X.  For  any  Y  Q  {yr  . . ym)  we  further 
define 


h(Y.X)-2y6Yh(y.X). 

"S  ^  * 

For  example,  if  y1  -  XjX2  +  XjXr  y2  -  xpc2  +  XjXj,  Y  -  {y2,  y2},  and  X  -  {x2,  x2,  x^},  then  h(ylt  X)  -  1, 
h(y2,  X)  -  2,  and  h(Y,  X)  -  3.  Define  the  S-combinaiion  number  to  be 

H(S)  -  max{h(Y,  X)  |  |Y|  <  S,  |X|  £  S}. 

We  have  the  following  result. 

Theorem  6.1.  Suppose  that  H(S)  -  fl(S).  Then  for  any  independent  evaluation  of  a 
multivariate  expression  of  degree  £  D, 

Q  •  D  *  H(S)/S  -  2(| V|X 

where  |V|  is  the  total  number  of  vertices  in  the  graph  corresponding  to  the  independent 
evaluation. 

Proof:  Let  {Vr  V2,  . .  .,  VJ  be  an  S-partition  of  the  graph  associated  with  the  independent 
evaluation.  We  shall  prove  the  following. 

(i)  Each  V.,  1  <  i  <  h,  can  have  at  most  H(S)  +  2S  internal  vertices.  (An  internal  vertex  is 
defined  to  be  a  vertex  belonging  to  the  internal  vertex  set  of  some  summation  tree.) 

(ii)  There  are  at  least  |  V|/(2D)  internal  vertices  in  the  graph. 

By  property  P3  in  the  definition  of  S-partition,  the  minimum  set  of  V  has  at  most  S  vertices. 

This  implies  that  V.  can  have  nonempty  intersections  with  internal  vertex  sets  of  at  most  S 
summation  trees,  since  by  E2  each  of  such  intersections  has  at  least  one  distinct  vertex  in  the 
minimum  set.  Thus,  to  bound  the  number  of  internal  vertices  that  V  can  have,  wc  need  only 
consider  summation  trees  for  S  y/s.  By  property  P2  of  S-partition,  wc  note  that  V  has  a  dominator 
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set  D(  of  size  no  more  than  S.  By  the  definition  of  H(S),  from  I>  one  can  form  at  most  H(S)  terms 
appearing  in  the  S  y/s.  These  terms,  together  with  possible  vertices  in  D;  that  are  already  internal 
vertices,  can  generate  at  most  H(S)  +  2S  internal  vertices.  We  have  shown  (i).  To  prove  (ii),  let  A 
be  the  total  number  of  internal  vertices  in  the  graph  corresponding  to  the  independent  evaluation. 
Then  the  total  number  of  external  vertices,  or  terms,  in  all  the  summation  trees,  is  no  greater  than 
2A.  Each  product  term  requires  at  most  D  -  1  multiplications;  thus  the  total  number  of  vertices 
|V|  in  the  graph  satisfies: 

|V|  *  2A(D  -1)  +  AS  2AD. 

This  proves  (ii).  It  follows  from  (i)  and  (ii)  that 

h  >  (|V|/2D)  /  (H(S)  +  2S), 

and  by  Lemma  3.1, 

Q-2(S|V|/(D  (H(2S)+2S))). 

The  theorem  follows  from  the  assumption  that  H(S)  »  U(S).  □ 

Corollary  6.1.  For  the  ordinary  matrix-vector  multiplication  algorithm  for  multiplying  an  mxn 
matrix  with  an  n-vector, 

Q  •  S  -  Q(mn), 

assuming  that  entries  in  the  matrix  can  be  generated  on-the-fly  and  thus  are  not  required  to  be 
input 

Proof:  The  corollary  follows  immediately  by  noting  that  H(S)  -  9(S2)  and  D  ■  1.  □ 

Lemma  6.1.  For  matrix-matrix  multiplication, 

H(S)-9(S3/2). 

Proof:  We  shall  only  prove  H(S)  -  0(S3/2),  since  it  is  trivial  to  see  H(S)  -  Q(S3/2).  Consider 
the  matrix  multiplication,  AB  -  C.  Let  W  be  any  set  of  entries  in  A  and  B,  with  |  W|  ^  S.  Partition 
A  into  two  classes  as  follows.  Class  Ad  consists  of  all  rows  in  A,  each  of  which  has  at  least 
Vs  entries  in  W,  and  class  Ad  consists  of  the  rest  of  rows  in  A.  Accordingly,  matrix  C  is 
partitioned  into  two  classes,  AdB  and  AdB.  Since  Ad  can  have  at  most  Vsf  rows,  and  since  in  any 
row  of  AdB  an  entry  in  B  can  appear  at  most  once  (and  B  has  no  more  than  S  entries  in  W),  the 
maximum  number  of  terms  in  AdB  that  can  be  obtained  by  multiplying  elements  in  W  is  at  most 
S-Vsf  -  S3/2.  For  entries  in  AdB.  each  of  them  can  be  obtained  by  multiplying  at  most 
VS  elements  in  W,  since  each  row  in  Ad  has  at  most  VS*  elements  in  W.  Therefore,  in  any  S 
entries  of  A' B,  there  are  at  most  S  •  VS  »  S3/2  terms  that  can  be  obtained  by  multiplying  elements 
inW.  □ 

By  Theorem  6.1  and  Lemma  6.1,  we  have  the  v.lowing  result 

Corollary  6.2.  For  the  ordinary  matrix-matrix  multiplication  algorithm  for  multiplying  mxk 
and  kxn  matrices, 

Q  •  VS  -  Q(mkn). 
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7.  Lower  Bounds  for  Products  of  Graphs 

As  demonstrated  in  Sections  4  and  5,  one  can  establish  lower  bounds  on  Q  by  proving  upper  bounds  on  the 
size  of  any  vertex  set  that  has  a  dominator  set  of  size  at  most  S.  This  is  equivalent  to  proving  lower  bounds  on 

D(n)  -  the  minimum  size  of  a  dominator  set  for  any  vertex  set  having  no  less  than  n  vertices. 

In  this  section  we  show  that  lower  bounds  on  D(n)  for  the  product  of  two  graphs  can  be  obtained  from  lower 
bounds  on  D(n)  for  individual  graphs.  (See,  for  example,  [3]  for  the  definition  of  the  product  of  two  graphs.) 
Let  GjxG2  be  the  product  of  Gx  and  G2.  A  vertex  (v2,  v2)  e  GjxG2  is  defined  to  be  an  input  (or  output)  of 
GjxGj  if  v:  is  an  input  of  Gx  or  v2  is  an  input  of  G2,  (or,  respectively,  v2  is  an  output  of  Gj  and  v2  is  an 
output  of  G2.)  Of  course  D(n)  depends  on  the  graph  on  which  it  defines;  we  use  D^n),  D2(n)  and  D(n)  to 
distinguish  the  case  when  the  graph  is  Gj,  G2  and  G  respectively. 

Lemma  7.1.  If  f  is  a  positive  function  such  that  f(x)/x  is  non-increasing,  2a=  >  T,T2,  and 
0<aj<T2,  then 

Sfia^T^). 

Proof: 

2^)  >  2a,f(T2)/T2  >  TjffTp.  □ 


Theorem  7.1.  (The  Production  Theorem  for  Dominators) 

If  Dj(n)  -  Q(dj(n))  where  di(  i  =  1,  2,  is  a  positive,  non-decreasing  function  such  that  ds(x)/x  is 
non-increasing,  then 

EKn^)  -  Q(min{n1  •  d^),  n2  •  d^)}). 

Proof:  Let  W  be  a  subset  in  V2xV2  of  size  n1n2.  Define 

U2  -  the  set  of  vertices  p2  in  V2  for  which  |  W  n  (V1x{p2})|  >  n2, 
and 

u2-v2-u2. 

Qearly,  we  have  |U2|  <  n2  giving 

|Wn({Pl}xU2)|^n2,  (5) 

and  for  p  €  U2, 

|W  n  (V2  x  {p2})|  <  n2.  (6) 

One  of  the  following  two  cases  must  hold. 


Case  1.  |W  n  (VjxUj)!  £  n^/2. 

Let  Pj  be  any  vertex  in  V,.  Any  dominator  set  for  W  n  ({pJxVp  is  of  size  at  least 
c^flW  n  ({pJxVpj).  Thus  the  size  of  any  dominator  set  for  W  satisfies: 


Ddijnp  >  2Pi£Vld2(lW  n  <{Pi>xV2)|). ' 

Since  U2  is  a  subset  of  V2  and  d2isa  non-decreasing  function,  we  have 

DOyij)  >  2p^tV  djdW  n  ({Pl}xU2)|). 

By  the  definition  of  Case  1, 

ZPi€Vi  |W  n  ({Pl}xU2)|  >  n^/2.  (7) 

By  Lemma  7.1,  it  follows  from  (5)  and  (7)  that 

2L  ^  tyl1 W  n  ({Pi}*U2)|)  >  nx  •  tynJ/2, 
implying 

LXn^j)  >  n2  •  d2(n2)/2. 

Case  2.  |W  n  (VjxU^I  >  n^/2. 

Let  p2  be  any  vertex  in  V2.  Any  dominator  set  for  W  n  (Vjxfpj})  is  of  size  at  least 
dx(|W  n  (V^tp^)!).  Thus  the  size  of  any  dominator  set  for  W  satisfies: 

EKn^)  >  SP2£v2  dx(|W  n  (V^})!. 

Since  U2  is  a  subset  of  V2,  we  have 

D(nin2)  >  2p2£U'  dj(|W  n  (V^p^l). 

By  the  definition  of  Case  2, 

2p2£U'  I W  n  (V1x{p2})|  >  n^/2,  (8) 

By  Lemma  7.1,  it  follows  from  (6)  and  (8)  that 

2p2£u^di(lw  n  (viX{p2»D  *  VW72’ 

implying 

L^n^j)  >  n2  •  d^np/2.  □ 

Let  Lx  -  {V,  E}  be  a  directed  line  where  V  -  {1,  2, ... ,  m},  and  E  -  {(i,  i+1)  |  i  —  1,  2 . m-1},  with  unique 

input  ”1”  and  output  "m.”  We  have  D.  (n)  -  1  for  any  n  £  m.  See  Figure  7-1. 

4 

Let  Lj  *  LjxL^  Then 

D.  (n2)  -  Q(min{l  •  n,  1  •  n}), 

4 

giving 

D^fn2)  -  0(n). 

Let  Lj  -  LjxLj.  Then 

D,  (n3)  -  Q(min{n  •  n,  n2  •  1}), 

4 

giving 

D,(n3)-0(n2). 

l3 


Figure  7-1:  The  product  of  two  directed  lines,  where  each  V  represents  an  input. 


LetLd 


■  x . . .  x  L,  that  is,  Ld  is  the  product  of  d  L^’s.  Then  similarly, 

DL(nd)-0(nd-1). 


(9) 


Corollary  7.1.  For  the  product  Ld  with  d  >  2, 

Q.si/(d-l) .  n(md). 

Proof:  By  (9),  the  maximum  size  of  any  vertex  set  that  has  a  dominator  set  of  size  at  most  S  is 
O^Sd/(d-l))  since  there  arc  a  total  of  md  vertices  in  Ld,  we  have 

P(S)  -  Q(md/Sd/W‘1,X 

by  which  the  Corollary  follows  from  Lemma  3.1.  □ 

We  have  a  similar  product  theorem  for  separators  of  a  graph.  For  the  special  case  Ld,  bounds  on  the  sizes 
of  minimum  separators  have  been  established  by  A.  L.  Rosenberg  [9]. 


8.  Summary  and  Concluding  Remarks 

To  compare  I/O  requirements  for  different  algorithms,  we  propose  the  use  of  the  following  measure.  The 
decomposabiliiy  factor  A(S)  of  an  algorithm  or  graph  G  -  (V,  E)  is  defined  to  be  the  ratio  between  the 
sequential  time  of  the  algorithm,  that  is  |  V|,  and  the  minimum  I/O  time  Q  when  assuming  S  red  pebbles  are 
used.  Thus, 
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For  a  given  algorithm,  |V|  is  fixed.  We  see  that  the  larger  the  X(S)  is,  the  less  the  I/O  is  required.  A  summary 
of  results  of  this  paper  on  specific  algorithms  or  graphs,  expressed  in  terms  of  bounds  on  X(S),  is  as  follows: 


Algorithms  or  Graphs 

Matrix-vector  multiplication  (ordinary  algorithm) 
Odd-even  transposition  sorting  network 
Matrix-matrix  multiplication  (ordinary  algorithm) 
Ld.(d>2) 

FFT 

Snake-like  directed  mesh 


e(S) 

e(S) 

G(VS) 

e(s1/(d-1>) 

6(log  S) 
0(1) 


It  is  also  possible  to  establish  upper  bounds  on  X(S)  for  a  class  of  algorithms  for  solving  a  given  problem. 
For  example,  it  has  been  shown  recently  that  for  any  sorting  algorithm  based  on  the  decision  tree  model, 
X(S)-O(logS)[10). 

The  problem  of  establishing  bounds  on  X(S)  is  closely  related  to  several  other  graph  partitioning  problems. 
We  intend  to  work  on  some  of  these  partitioning  problems  in  the  future,  and  show  how  they  are  related  to  the 
I/O  complexity  problem  addressed  in  this  paper. 
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